Classification

9/22/2020

Agenda

Why not linear regression?
Logistic regression
Multinomial logistic regression
Evaluating accuracy
Bayes theorem, LDA, QDA
k Nearest Neighbours

Introduction

The classification problem

When the $Y$ we are trying to predict is categorical (or qualitative), we say we have a classification problem.

Qualitative variables take values in an unordered set $\mathcal{C}$, such as: $\text{eye color} \in \{\text{brown, blue, green}\}, \text{email} \in \{\text{spam, not spam}\}$.

Given a feature vector $\mathbf{X}$ and a qualitative response $Y$ taking values in the set $\mathcal{C}$, the classification task is to build a function $C(\mathbf{X})$ that takes as input the feature vector $\mathbf{X}$ and predicts its value for $Y$, i.e. $C(\mathbf{X}) \in \mathcal{C}$
Often we are more interested in estimating the probabilities that $\mathbf{X}$ belongs to each category in $\mathcal{C}$

The classification problem

Given the probability we can classify using a rule like “guess Spam” if $\mathbb{P}(Y = \text{Spam} \mid x) > 0.5$.

There are a large number of methods for classification:

we have seen k-NN in the first class
we will now learn about logistic regression and Bayes classifiers
later, we will study trees and Random Forests

The classification problem

Remember that the goal in regression is to find the relationship \[f(\mathbf{X}) = \mathbb{E}[Y \mid \mathbf{X}]\]
Many classification problems have instead the goal of finding \[f(y \mid \mathbf{x}) = \mathbb{P}(Y = y \mid \mathbf{X} = \mathbf{x})\]
Other classification methods just try to assign a $Y$ to a category given $\mathbf{X} = x$

Linear probability model

Default data

We will start with binary classification, i.e. the case $Y \in \{0, 1\} = \{\text{yes, no}\}$ and we want to find the relationship \[f(y = 1 \mid \mathbf{x}) = \mathbb{P}(Y = 1 \mid \mathbf{X} = \mathbf{x}).\]

This gives us also $\mathbb{P}(Y = 0 \mid \mathbf{X}) = 1 - \mathbb{P}(Y = 1 \mid \mathbf{X})$.

##   default student   balance    income
## 1      No      No  729.5265 44361.625
## 2      No     Yes  817.1804 12106.135
## 3      No      No 1073.5492 31767.139
## 4      No      No  529.2506 35704.494
## 5      No      No  785.6559 38463.496
## 6      No     Yes  919.5885  7491.559

Default data

Default data:

$Y$: whether or not a customer defaults on their credit card (No or Yes)
$X$: average balance that the customer has remaining on their credit card
10,000 observations, 333 defaults ($3.33\%$ default rate).

A simple approach

Can we simply perform a linear regression of $Y$ on $X$ and classify as “Yes” if $\hat{Y} > 0.5$?

Since in the population $\mathbb{E}(Y \mid X = x) = \mathbb{P}(Y = 1 \mid X = x)$, we might think that regression is perfect for this task
Suppose, like in linear regression, that \[\mathbb{P} (y_i = 1 \mid x_i) = f(x_i) = \beta_0 + \beta_1 x_i\]

This is called the linear probability model: the probability of a “yes” outcome ($y = 1$) is linear in $x_i$.

A simple approach

However, linear regression might produce probabilities that fall outside $(0,1)$ because \[\mathbb{P} (y_i = 1 \mid x_i) = f(x_i) = \beta_0 + \beta_1 x_i\]
Logistic regression is more appropriate
This gets even worse in multiple class scenarios

The problem is:

the left-hand side needs to be constrained to fall between 0 and 1, by the basic rules of probabilities
but the right-hand side is unconstrained - it can be any real number

Default data

Multiple classes

Now suppose we have a response variable with three possible values. A patient presents at the emergency room, and we must classify them according to their symptoms.

\[Y = \begin{cases} 1 & \text{if stroke} \\ 2 & \text{if drug overdose} \\ 3 & \text{if epileptic seizure} \end{cases}\]

This coding suggests an ordering, and in fact implies that the difference between stroke and drug overdose is the same as between drug overdose and epileptic seizure
Linear regression is not appropriate here: Multiclass Logistic Regression or Discriminant Analysis are more appropriate

Logistic regression

The Logistic Model

Logistic regression uses the power of linear modeling and estimates $\mathbb{P}(Y = y \mid x)$ by using a two step process:

apply a linear function to $x$: $x \rightarrow \eta = \beta_0 + \beta_1 x$
apply the logistic function $F$ to $\eta$ to get a number between $0$ and $1$, $\mathbb{P}(Y = 1 \mid x) = F(\eta) = F(\beta_0 + \beta_1 x)$

F is called link function: a standard choice is \[F(\eta) = \dfrac{e^\eta}{1 + e^\eta} = \dfrac{1}{1 + e^{-\eta}}\]

At = 0, $F(\eta) = 0.5$
When $\eta \rightarrow +\infty$, $F(\eta) \rightarrow 1$ and when $\eta \rightarrow -\infty$, $F(\eta) \rightarrow 0$

The Logistic Model

The key idea is that $F(\eta)$ is always between $0$ and $1$ so we can use it as a probability. Note that $F$ is increasing, so if $\eta$ goes up $\mathbb{P}(Y = 1 \mid x)$ goes up.

The Logistic Model

\[F(\eta) = \dfrac{e^\eta}{1 + e^\eta} = \dfrac{1}{1 + e^{-\eta}}\]

This is called the “logistic” or “logit” link, and it leads to the logistic regression model \[\mathbb{P}(Y_i = 1 \mid x_i) = \dfrac{e^{\beta_0 + \beta_1 x_i}}{1 + e^{\beta_0 + \beta_1 x_i}}\]

This is a very common choice of link function, for a couple of good reasons. One is interpretability: a little algebra shows that

\[\log \left( \dfrac{p}{1-p} \right) = \beta_0 + \beta_1 x_i \Rightarrow \dfrac{p}{1-p} = e^{\beta_0 + \beta_1 x_i}\]

so that it is a log-linear model for the odds of a “yes” outcome.

Parameter interpretation

\[\text{Odds} = \dfrac{p}{1-p} = e^{\beta_0} \cdot e^{\beta_1 x_i}\]

So $e^{\beta_1}$ is an odds multiplier (or odds ratio) for a one-unit increase in the feature $x$.

Say that $\beta_1 = 1$, i.e. $e^{\beta_1} \approx 2.72$.

If $p = 0.2$, the odds are $\dfrac{p}{1-p} = 0.25$. A one-unit increase for $x$ gives a probability of $p^{\prime} = 0.4$.

If $p = 0.8$, the odds are $\dfrac{p}{1-p} = 4$. A one-unit increase for $x$ gives a probability of $p^{\prime} = 0.92$.

Parameter estimation

Logistic regression gives us a formal parametric statistical model (like linear regression with normal errors)

\[Y_i \sim \text{Bernoulli}(p_i), \quad p_i = F(\beta_0 + \beta_1 x_i)\]

To estimate the parameters $\beta_0$ and $\beta_1$, we use maximum likelihood. That is, we choose the parameter values that make the data we have seen most likely.

\[L(\beta_0, \beta_1) = \prod_{i =1}^n \left\{ p_i^{y_i} (1 - p_i)^{1 - y_i} \right\}\]

This likelihood gives the probability of the observed zeros and ones in the data. In R we use the glm function.

Maximum likelihood review

Method to determine the optimal values for the parameters of a model. The parameter values are found such that they maximize the likelihood that the model produced the data that were actually observed

Let us suppose we have observed 10 data points
We assume that the data generation process can be described by a normal distribution
The normal distribution has 2 parameters $(\mu, \sigma)$. Different values of these parameters result in different curves

Maximum likelihood review

Which curve was most likely responsible for creating the data points that we observed?

Maximum likelihood review

Same for linear regression and same here!

Making predictions

## 
## Call:
## glm(formula = numeric_def ~ balance, family = "binomial", data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2697  -0.1465  -0.0589  -0.0221   3.7589  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.065e+01  3.612e-01  -29.49   <2e-16 ***
## balance      5.499e-03  2.204e-04   24.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1596.5  on 9998  degrees of freedom
## AIC: 1600.5
## 
## Number of Fisher Scoring iterations: 8

The MLE is $\hat{\beta}_0 = −10.65, \hat{\beta}_1 = 0.0055$
Given $x = 2000$, $\eta = \hat{\beta}_0 + \hat{\beta}_1 x = 0.35$
$\begin{aligned} &\mathbb{P}(Y = 1 \mid x = 2000) \\ &\ = F(0.35) = 0.59 \end{aligned}$

Coefficient interpretation

## 
## Call:
## glm(formula = numeric_def ~ balance, family = "binomial", data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2697  -0.1465  -0.0589  -0.0221   3.7589  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.065e+01  3.612e-01  -29.49   <2e-16 ***
## balance      5.499e-03  2.204e-04   24.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1596.5  on 9998  degrees of freedom
## AIC: 1600.5
## 
## Number of Fisher Scoring iterations: 8

$\hat{\beta}_1 > 0$ suggests that larger balances are associated with higher risk of default
$e^{\beta_1}$ is an odds multiplier or odds ratio for for a one-unit increase in feature $x$
$\hat{\beta}_1 = 0.0055$, so having $1000\$$ multiplies odds of default by $5.5$

Testing significance

## 
## Call:
## glm(formula = numeric_def ~ balance, family = "binomial", data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2697  -0.1465  -0.0589  -0.0221   3.7589  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.065e+01  3.612e-01  -29.49   <2e-16 ***
## balance      5.499e-03  2.204e-04   24.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1596.5  on 9998  degrees of freedom
## AIC: 1600.5
## 
## Number of Fisher Scoring iterations: 8

Confidence Interval for $\beta_1$: $\begin{aligned} &\hat{\beta}_1 \pm 2 \text{SE}(\beta_1) \\ = &0.0055 \pm 2(0.00022)\end{aligned}$
Test $H_0: \beta_1 = \beta_1^0$, $z = \dfrac{\beta_1 - \beta_1^0}{\text{SE}(\beta_1)}$. If $H_0$ is true, $z$ should look like a standard normal. $z = \dfrac{0.0055 - 0}{0.00022}$, large for a standard normal Thus, reject the null that $\beta_1 = 0$

Other R output

Fisher Scoring iterations: it took 8 iterations for the optimization method to convergence
Deviance: $-2 \log(L(\beta_0, \beta_1))$:
- Residual deviance: deviance you get by plugging the MLE $(\hat{\beta}_0, \hat{\beta}_1)$ into the likelihood
- Null deviance is what you get by setting $\beta_1 = 0$ and then optimizing the likelihood over $\beta_0$ alone

A big likelihood is good, so a small deviance is good

When performing model selection, the deviance can be used as out of sample loss function (as we use sum of squared errors for a numeric $Y$)

Default data

Multiple Logistic Regression

We can extend our logistic model to several numeric $x$ by letting $\eta$ be a linear combination of the $x$’s instead of just a linear function of one $x$.

\[\log \left( \dfrac{p(\mathbf{X})}{1 - p(\mathbf{X})} \right) = \beta_0 + \beta_1 x_1 + \dots + \beta_p x_p\] \[ p(\mathbf{X}) = \dfrac{e^{\beta_0 + \beta_1 x_1 + \dots + \beta_p x_p}}{1 + e^{\beta_0 + \beta_1 x_1 + \dots + \beta_p x_p}}\]

Default data

## 
## Call:
## glm(formula = numeric_def ~ balance + student, family = "binomial", 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4578  -0.1422  -0.0559  -0.0203   3.7435  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.075e+01  3.692e-01 -29.116  < 2e-16 ***
## balance      5.738e-03  2.318e-04  24.750  < 2e-16 ***
## studentYes  -7.149e-01  1.475e-01  -4.846 1.26e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1571.7  on 9997  degrees of freedom
## AIC: 1577.7
## 
## Number of Fisher Scoring iterations: 8

The probability of default increases with the balance, but at any fixed balance a student is less likely to default.

Default data

## 
## Call:
## glm(formula = numeric_def ~ student, family = "binomial", data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.2970  -0.2970  -0.2434  -0.2434   2.6585  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.50413    0.07071  -49.55  < 2e-16 ***
## studentYes   0.40489    0.11502    3.52 0.000431 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 2908.7  on 9998  degrees of freedom
## AIC: 2912.7
## 
## Number of Fisher Scoring iterations: 6

Here the coefficient for the student dummy is positive, suggesting that a student is more likely to default.

Default data

## 
## Call:
## glm(formula = numeric_def ~ student + balance, family = "binomial", 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4578  -0.1422  -0.0559  -0.0203   3.7435  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.075e+01  3.692e-01 -29.116  < 2e-16 ***
## studentYes  -7.149e-01  1.475e-01  -4.846 1.26e-06 ***
## balance      5.738e-03  2.318e-04  24.750  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1571.7  on 9997  degrees of freedom
## AIC: 1577.7
## 
## Number of Fisher Scoring iterations: 8

But, in the multiple logistic regression, the coefficient for student is negative (students are less likely to default at any fixed level of balance)

Confounding

Same intuition of multiple linear regression: we know that when the $x$’s are correlated the coefficients for the old $x$’s can change when we add new $x$’s to a model.

Are balance and student “correlated”?

Confounding

Students tend to have higher balances than non-students, so their marginal default rate is higher than for non-students: if all you know is that a credit card holder is a student, then they are more likely to have a larger balance and hence more likely to default
But for each level of balance, students default less than non-students: if you know the balance, a student is less likely to default than a non-student

Multiple logistic regression can tease this out

Multinomial logistic regression

Logistic regression with more than two classes

Also referred to as multinomial regression.

So far we have discussed logistic regression with two classes. It is easily generalized to more than two classes. One version (used in the R package glmnet) has the symmetric form

\[\mathbb{P}(Y = k \mid X = x) = \dfrac{e^{\beta_{0k} + \beta_{1k} x_{1} + \dots + \beta_{pk} x_p}}{\sum_{l=1}^K e^{\beta_{0l} + \beta_{1l} x_{1} + \dots + \beta_{pl} x_p}}\]

Here there is a linear function for each class. Note, some cancellation is possible, and only $K - 1$ linear functions are needed.

Evaluating accuracy

Types of errors

Confusion matrix:

##     true
## pred    0    1
##    0 9628  228
##    1   39  105

Counts on the diagonal are success, while the off-diagonal represent the two kinds of failures you can make:

False positive rate: the fraction of negative examples that are classified as positive - $\frac{39}{39+9628} = 0.4\%$ in example
False negative rate: the fraction of positive examples that are classified as negative - $\frac{228}{228+105} \approx 68.5\%$ in example

Types of errors

Confusion matrix:

##     true
## pred    0    1
##    0 9628  228
##    1   39  105

Total accuracy: sum of the elements on the diagonal divided by total.

“Positive” and “negative” refer to the outcome predicted by the model.

Types of errors

We produced this table by classifying to class “Yes” if \[\mathbb{P}(\text{Default = Yes} \mid \text{Balance, Student}) \geq 0.5\]

We can change the two error rates by changing the threshold from $0.5$ to some other value $s \in [0, 1]$ \[\mathbb{P}(\text{Default = Yes} \mid \text{Balance, Student}) \geq s\] and vary the threshold $s$.

$s \rightarrow 1$: more conservative in predicting $Y = 1$. More false negatives, but less false positives
$s \rightarrow 0$: more conservative in predicting $Y = 0$. More false positives, but less false negatives

Types of errors

In order to reduce the false negative rate, we may want to reduce the threshold to $0.1$ or less.

ROC curve

ROC (receiver operator characteristics) curve: looks at the True Positives and False Positives for varying levels of $s$.

The ROC plot displays both simultaneously. Sometimes we use the AUC or area under the curve to summarize the overall performance (higher AUC is good).

The line is drawn at $Y = X$. It represents the performance you would get if $Y$ was independent of $X$.

Agenda

Introduction

The classification problem

The classification problem

The classification problem

Linear probability model

Default data

Default data

A simple approach

A simple approach

Default data

Multiple classes

Logistic regression

The Logistic Model

The Logistic Model

The Logistic Model

The Logistic Model

The Logistic Model

Parameter interpretation

Parameter estimation

Maximum likelihood review

Maximum likelihood review

Maximum likelihood review

Making predictions

Coefficient interpretation

Testing significance

Other R output

Default data

Multiple Logistic Regression

Default data

Default data

Default data

Default data

Confounding

Confounding

Multinomial logistic regression

Logistic regression with more than two classes

Evaluating accuracy

Types of errors

Types of errors

Types of errors

Types of errors

ROC curve

Question time